Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup race condition in daemon reports #1402

Merged
merged 1 commit into from
Aug 3, 2022
Merged

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Aug 3, 2022

In the case where prterun is operating on a node
with a different topology than the other nodes
AND daemon rank=1 is delayed in sending its callback
message such that one or more other daemons report
first, then we segfault as:

  • the first daemon to report records its signature
    and immediately is requested to return its topo

  • subsequent daemons with the SAME signature attempt
    to use the NULL topo from the topologies array to
    define their available CPUs

Resolve this by caching any daemons that report prior
to rank=1 so that we can compare their topo to that one.

Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit fc83ca4)

In the case where prterun is operating on a node
with a different topology than the other nodes
AND daemon rank=1 is delayed in sending its callback
message such that one or more other daemons report
first, then we segfault as:

* the first daemon to report records its signature
  and immediately is requested to return its topo

* subsequent daemons with the SAME signature attempt
  to use the NULL topo from the topologies array to
  define their available CPUs

Resolve this by caching any daemons that report prior
to rank=1 so that we can compare their topo to that one.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit fc83ca4)
@rhc54 rhc54 merged commit 5538e7e into openpmix:v3.0 Aug 3, 2022
@rhc54 rhc54 deleted the cmr30/up branch August 3, 2022 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant